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Overview 

• Why use the GPU? 

• Highly parallel workload 

• Free your CPU to do game code 

• Leverage compute 
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Overview 

• Emit 

• Simulate 

• Sort 

• Rendering 

• Rasterization or Tiled Rendering 
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Data Structures 


Particle Pool 

Position, Velocity, Age, Color, etc.. 




Sort List 

uint index; float distanceSq; 


\ 


Dead List 

uint index; 
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Simulate Compute Shader 

Update Particles. Add alive ones to Sort List, add dead ones to 

Dead List 




# * 1 

r ^ 

RWStructuredBuffero 

r ^ 

AppendStructuredBuffer<uint> 
g_DeadList.Append( index ); 1 

| RWStructuredBuffer<float2> 

1 g_SortList.IncrementCounter(); 

Particle 

Pool 

L. -J 

Dead List 

uint index; | 

I Sort List 

I uint index; float distanceSq 
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Collisions 

• Primitives 

• Heightfield 

• Voxel data 

• Depth buffer 

[Tchoull] 
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Depth Buffer Collisions 

• Project particle into screen space 

• Read Z from depth buffer 

• Compare view-space particle 
position vs view-space position of Z 
buffer value 

• Use thickness value 



view space 
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Depth Buffer Collision Response 

• Use normal from G-buffer 

• Or take multiple taps a depth buffer 

• Watch out for depth 
discontinuities 
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• Sort for correct alpha 
blending 

• Additive blending just 
saturates the effect 

• Bitonic sort 
parallelizes well on 
GPU 
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Bitonic Sort 


7I3I6I8I1I4I2I5 


for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) 

{ 

// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 

// End: GPU part of the sort 

} 

} 
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Bitonic Sort (Pass 1) 



DO 


if m 


for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) // subArraySize == 2 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) // compareDist == 1 

{ 

// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 

// End: GPU part of the sort 

} 

} 
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Bitonic Sort (Pass 2) 



for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) // subArraySize == 4 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) // compareDist == 2 

{ 

// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 

// End: GPU part of the sort 

} 

} 
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Bitonic Sort (Pass 3) 



for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) // subArraySize == 4 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) // compareDist == 1 

{ 

// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 

// End: GPU part of the sort 

} 

} 
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Bitonic Sort (Pass 4) 



for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) // subArraySize == 8 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) // compareDist == 4 

{ 

// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 

// End: GPU part of the sort 

} 

} 








GAME DEVELOPERS CONFERENCE* 2014 


MARCH 17-21 y 2014 GDCONF.COM 


Bitonic Sort (Pass 5) 



for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) // subArraySize == 8 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) // compareDist == 2 

{ 


// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 
// End: GPU part of the sort 


1 


} 
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Bitonic Sort (Pass 6) 



for ( subArraySize=2 ; subArraySize<ArraySize ; subArraySize*=2 ) // subArraySize == 8 

{ 

for ( compareDist=subArraySize/2 ; compareDist>0 ; compareDist/=2 ) // compareDist == 1 

{ 

// Begin: GPU part of the sort 
for each element n 

n = selectBitonic (n, n A compareDist) ; 

// End: GPU part of the sort 

} 

} 
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Pixel Shader 

Texturing and tinting. Depth fade for soft particles. 
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Pixel Shader 

Texturing and tinting. Depth fade for soft particles. 
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Rasterization 

• Drawlndexedlndirectlnstanced() or Drawlndirectlnstanced() 

• Vertexld = particle index (or VertexId/4 for VS billboarding) 

• 1 instance 

• Heavy overdraw on large particles - restricts game design 

• Fit polygon billboard around texture [Persson09] 

• Render to half size buffer [Cantlay07] 

• Sorting issues 

• Loss of fidelity 
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Tiled Rendering 

• Inspired by Forward+ [Haradal2] 

• Screen-space binning of particles instead of lights 

• Per-tile 

• Cull & Sort 

• Per pixel/thread 

• Evaluate color of each particle 

• Blend together 

• Composite back onto scene 
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Tiled li ght particle culling 
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Tiled particle culling 
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Tiled particle culling 
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Thread Group View 






numthreads[32,32,l] 

Culling 1024 particles 
in parallel 

Write visible indices to 
LDS 
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Per Tile Bitonic Sort 

• Because each thread adds a visible particle 

• Particles are added to LDS in arbitrary order 

• Need to sort 

• Only sorting particles in tile rather than global 
list 
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Tiled Rendering (1 thread = 1 pixel) 

• Set accum color to float4( 0, 0, 0, 0 ) 

• For each particle in tile (back to front) 

• Evaluate particle contribution 

• Radius check 

• Texture lookup 

• Optional normal generation and lighting 

• Manually blend 

• color = ( srcA x srcCol ) + ( invSrcA x destCol ) 

• alpha = srcA + ( invSrcA x destA ) 

• Write to screen size UAV 
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Tiled Rendering, improved! 

• Set accum color to float4( 0, 0, 0, 0 ) 

• For each particle in tile (front to back) 

• Evaluate particle contribution 

• Manually blend [Bavoil08] 

• color = ( invDestA x srcA x srcCol ) + destCol 

• alpha = srcA + ( invSrcA x destA ) 

• if ( accum alpha > threshold ) 

accum alpha = 1 and bail 

• Write to screen size UAV 
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Coarse Culling 

• Bin particles into 8x8 

• UAVO for indices 

• Array split into sections using offsets 

• UAV1 for storing particle count per bin 

• 1 element per bin 

• Use InterlockedAdd() to bump counter 

• For each alive particle 

• For each bin 

• Test particle against bin's frustum planes 

• Bump counter in UAV1 to get slot to write to 

• Add particle index to UAVO 
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Demo 



Demo with full source available soon 
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Performance Results 


mode frame time (ms)* 


Rasterization 4.86 


Tiled 3.15 


Simulation 0.39 

Coarse Culling 0.06 

Tile Culling 0.43 


frame time (ms)* 


Breakdown 



Render 




AMD 


1.60 


*AMD Radeon R9 290X @ 1080p 
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Performance Results 


mode frame time (ms)* 


Rasterization 25.0 

Tiled 5.1 



*R9 290X @ 1080p 
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Conclusions 

• Leverage compute for particle simulations 

• Depth buffer collisions 

• Bitonic sort for correct blending 

• Tiled rendering 

• Faster than rasterization 

• Great for combating heavy overdraw 

• More predictable behavior 

• Future work 

• Volume tracing 

• Add arbitrary geometry for OIT 
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Questions? 


Demo with full source available soon 

http://developer.amd.com/tools/araphics-development/amd-radeon-sdk/ 
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